As AI labs race to show off the capabilities of their newest models, many have turned to crowdsourced benchmarking platforms like Chatbot Arena to showcase progress. But a growing chorus of researchers and experts say this trend may be undermining the credibility of AI evaluations — and creating ethical blind spots in the process.
Over the past few years, major players like OpenAI, Google, and Meta have leaned on platforms that allow everyday users to evaluate models by comparing their outputs. A strong showing on one of these leaderboards often becomes a marketing win. But critics argue that these metrics are being used out of context — and often, without enough scientific rigor.
Emily Bender, a linguistics professor at the University of Washington and co-author of The AI Con, is one of those critics. She points specifically to Chatbot Arena, which pits two anonymous AI models against each other and asks users to pick the better response.
“To be valid, a benchmark must measure something specific — and that something needs to be clearly defined,” Bender said. “There’s no evidence that choosing one output over another actually correlates with any meaningful preference or utility.”
‘Co-opted to Sell AI Hype’
Others, like Asmelash Teka Hadgu, co-founder of AI startup Lesan and a fellow at the Distributed AI Research Institute (DAIR), argue that platforms like Chatbot Arena are being “co-opted” by companies to promote overstated claims. Hadgu points to a recent controversy where Meta fine-tuned a version of its Llama 4 Maverick model specifically to score well on Chatbot Arena — only to quietly release a different, lower-performing model to the public.
“Benchmarks should be dynamic and decentralized,” Hadgu said. “They need to be tailored to specific domains like healthcare or education and overseen by independent professionals with domain expertise.”
He and Kristine Gloria, a former director at the Aspen Institute’s AI program, also raised concerns about compensation and labor ethics. Like the data labeling industry before it, crowdsourced model evaluation risks exploiting unpaid contributors, they warn.
“Citizen science models have value,” Gloria said, “but we can’t rely on crowdsourced benchmarking as the sole metric. As models and use cases evolve rapidly, those benchmarks can quickly lose relevance.”
Crowdsourcing Can’t Replace Rigorous Testing
Even companies in the space acknowledge the limitations. Matt Frederikson, CEO of Gray Swan AI, a platform that runs crowdsourced red teaming challenges (sometimes with cash prizes), said public participation brings diversity — but it's not a replacement for serious vetting.
“Internal benchmarks, algorithmic red teams, and expert testing remain essential,” he said. “And both benchmark designers and model developers have a duty to communicate clearly and respond when issues are raised.”
Wei-Lin Chiang, a doctoral student at UC Berkeley and co-founder of LMArena (the group behind Chatbot Arena), defended the platform’s mission — but agrees it’s not a silver bullet.
“We absolutely support complementary evaluations,” Chiang said. “Our goal is to offer a community-driven space to reflect public preferences. But that’s not the same as a formal safety or performance test.”
He added that the Maverick incident stemmed not from flaws in Arena’s design but from labs misusing its policies. LMArena has since updated its guidelines to ensure better transparency going forward.
“We don’t think of our community as unpaid testers,” Chiang said. “They’re here to engage with AI, learn, and provide meaningful feedback. As long as our leaderboard honestly reflects the community’s voice, we believe it has a place — just not the only place — in model evaluation.”